Neural network speech processing for toys and consumer electronics

The ongoing challenge in speech research is recognizing continuous, unconstrained speech. In comparison, isolated word recognition with small vocabularies is easy. Many commercial efforts are aimed at the high-end problem. Sensory, Inc. has successfully focused on the low end, producing a family of low-cost speech recognition chips for toys, consumer electronics, electronic learning aids, and home appliances. The chips are based on a 4 MIPS 8-bit microcontroller with on board AGC, A/D, D/A, and digital filtering. The microcontroller can be programmed for speaker-independent or dependent recognition, voice verification (recognizing a stored password spoken by particular speaker), polyphonic music synthesis, speech synthesis, voice record and playback, and has enough power to drive and communicate with the application product. The speech recognition and voice verification products are neural-network based. Speaker-independent recognition of up-to-10 word vocabularies achieves accuracies of 95-98%. Speaker-dependent recognition of vocabularies of up-to-60 items has an accuracy greater than 99%. The chip can be programmed to handle larger vocabularies by context-dependent switching of recognition sets. The neural net architectures are fairly standard, but the hardware and real-world usage impose some interesting challenges, including speed constraints which necessitate integer arithmetic, very limited RAM, and on-line speaker adaptation.

Retrieve Paper (postscript)